Skip to content

Updated spark-quickstart instructions to use Ozone as a backend store#15034

Open
SaketaChalamchala wants to merge 1 commit intoapache:mainfrom
SaketaChalamchala:iceberg-ozone-support
Open

Updated spark-quickstart instructions to use Ozone as a backend store#15034
SaketaChalamchala wants to merge 1 commit intoapache:mainfrom
SaketaChalamchala:iceberg-ozone-support

Conversation

@SaketaChalamchala
Copy link
Contributor

Following the discussion in the thread #14945 submitting a patch to use Ozone as a backend store for Iceberg in quick start docker environments.
Tested manually with spark-shell.
This patch will need to be merged after databricks/docker-spark-iceberg#246 which also contains updates to spark-defaults.conf
Screenshot 2026-01-12 at 2 02 36 AM

Screenshot 2026-01-12 at 2 03 10 AM

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request replaces MinIO with Apache Ozone as the backend storage for the Spark Quickstart Docker environment, addressing concerns about MinIO's maintenance mode and security issues discussed in issue #14945.

Changes:

  • Replaced MinIO service with Apache Ozone multi-service architecture (SCM, OM, Datanode, Recon, S3 Gateway)
  • Updated S3 endpoint configuration to point to Ozone S3 Gateway (s3.ozone:9878)
  • Added CREATE NAMESPACE statement to SQL example for proper namespace initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-datanode service is missing a depends_on clause to ensure proper startup order. The datanode requires both ozone-scm (Storage Container Manager) and ozone-om (Object Manager) to be running before it can start successfully. Without this dependency, the datanode may fail to start or experience startup issues.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om

Copilot uses AI. Check for mistakes.
ports:
- 9888:9888
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-recon service should have a depends_on clause to ensure the ozone-scm and ozone-om services are available before it starts. Recon is the monitoring and management service for Ozone and needs the other core services to be operational to function properly.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
depends_on:
- ozone-scm
- ozone-om

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +165
ozone-s3g:
<<: *ozone-image
ports:
- 9878:9878
environment:
<<: *ozone-common-config
command:
- sh
- -c
- |
set -e
ozone s3g &
s3g_pid=$$!
until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done;
ozone sh bucket delete /s3v/warehouse || true
ozone sh bucket create /s3v/warehouse
wait "$$s3g_pid"
networks:
iceberg_net:
aliases:
- s3.ozone
- warehouse.s3.ozone
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ozone-s3g service is missing a depends_on clause to ensure the prerequisite services are started first. The S3 Gateway requires ozone-scm, ozone-om, and ozone-datanode to be available before it can successfully handle S3 requests. While the shell script waits for ozone-om to respond, the depends_on clause is still needed to prevent the container from starting prematurely and ensure proper orchestration.

Copilot uses AI. Check for mistakes.
ports:
- 9878:9878
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ozone S3 Gateway (ozone-s3g) configuration is missing AWS credential environment variables, while the spark-iceberg and rest services are configured with AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. For the S3 API to work properly with authentication, the Ozone S3 Gateway needs to be configured to either accept these credentials or operate in anonymous mode. Without proper credential configuration, there may be authentication mismatches when Spark tries to access the S3 endpoint with the provided credentials.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1

Copilot uses AI. Check for mistakes.
Comment on lines +109 to 121
ozone-om:
<<: *ozone-image
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
- 9874:9874
environment:
<<: *ozone-common-config
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
WAITFOR: ozone-scm:9876
command: ["ozone", "om"]
networks:
iceberg_net:
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container name for the ozone-om service is not explicitly set, while other services in the docker-compose file (like spark-iceberg and iceberg-rest) have container_name specified. For consistency and easier debugging, consider adding explicit container names for all Ozone services, such as 'container_name: ozone-om' for the ozone-om service.

Copilot uses AI. Check for mistakes.
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
WAITFOR: ozone-scm:9876
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WAITFOR configuration on line 118 specifies 'ozone-scm:9876' to wait for the SCM service, but there's no corresponding WAITFOR configuration in the ozone-scm service to wait for its own initialization. The ozone-scm service is the first in the dependency chain but doesn't have any startup coordination. Consider verifying that the apache/ozone image handles the ENSURE_SCM_INITIALIZED check properly without needing an explicit WAITFOR, or document this dependency ordering requirement.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +142
ozone-datanode:
<<: *ozone-image
ports:
- 9864
command: ["ozone", "datanode"]
environment:
- MINIO_ROOT_USER=admin
- MINIO_ROOT_PASSWORD=password
- MINIO_DOMAIN=minio
<<: *ozone-common-config
networks:
iceberg_net:
aliases:
- warehouse.minio

ozone-om:
<<: *ozone-image
ports:
- 9001:9001
- 9000:9000
command: ["server", "/data", "--console-address", ":9001"]
mc:
depends_on:
- minio
image: minio/mc
container_name: mc
- 9874:9874
environment:
<<: *ozone-common-config
CORE-SITE.XML_hadoop.proxyuser.hadoop.hosts: "*"
CORE-SITE.XML_hadoop.proxyuser.hadoop.groups: "*"
ENSURE_OM_INITIALIZED: /data/metadata/om/current/VERSION
WAITFOR: ozone-scm:9876
command: ["ozone", "om"]
networks:
iceberg_net:

ozone-scm:
<<: *ozone-image
ports:
- 9876:9876
environment:
- AWS_ACCESS_KEY_ID=admin
- AWS_SECRET_ACCESS_KEY=password
- AWS_REGION=us-east-1
entrypoint: |
/bin/sh -c "
until (/usr/bin/mc alias set minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done;
/usr/bin/mc rm -r --force minio/warehouse;
/usr/bin/mc mb minio/warehouse;
/usr/bin/mc policy set public minio/warehouse;
tail -f /dev/null
"
<<: *ozone-common-config
ENSURE_SCM_INITIALIZED: /data/metadata/scm/current/VERSION
command: ["ozone", "scm"]
networks:
iceberg_net:

ozone-recon:
<<: *ozone-image
ports:
- 9888:9888
environment:
<<: *ozone-common-config
command: ["ozone", "recon"]
networks:
iceberg_net:
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Ozone services (ozone-datanode, ozone-om, ozone-scm, ozone-recon) store data in directories like /data/hdds and /data/metadata, but these directories are not mounted to host volumes. This means that when the containers are stopped or removed, all data will be lost. While this might be acceptable for a quickstart/demo environment, consider documenting this behavior or adding optional volume mounts for users who want to persist data across container restarts.

Copilot uses AI. Check for mistakes.
container_name: mc
- 9874:9874
environment:
<<: *ozone-common-config
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy user configuration allows any host and any group to proxy as the hadoop user, which is overly permissive for a production environment. While this may be acceptable for a quickstart/demo environment, consider adding a comment in the documentation to warn users not to use these settings in production, or at minimum document that these are demo-only settings for ease of use.

Suggested change
<<: *ozone-common-config
<<: *ozone-common-config
# WARNING: The following proxyuser settings are for quickstart/demo use only.
# Do not use this configuration as-is in production; restrict hosts and groups appropriately.

Copilot uses AI. Check for mistakes.
s3g_pid=$$!
until ozone sh volume list >/dev/null 2>&1; do echo '...waiting...' && sleep 1; done;
ozone sh bucket delete /s3v/warehouse || true
ozone sh bucket create /s3v/warehouse
Copy link

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script starts the S3 gateway in the background and then performs administrative operations. However, there's no check to ensure the S3 gateway itself is fully operational before proceeding with bucket operations. The wait only checks if ozone-om responds to 'volume list', but doesn't verify that the S3 gateway is ready. This could lead to race conditions where bucket operations might fail if the S3 gateway hasn't fully initialized. Consider adding a health check or retry logic for the bucket operations.

Suggested change
ozone sh bucket create /s3v/warehouse
max_retries=30
retry_delay=2
attempt=0
until ozone sh bucket create /s3v/warehouse >/dev/null 2>&1; do
attempt=$$((attempt + 1))
if [ "$$attempt" -ge "$$max_retries" ]; then
echo "Failed to create bucket /s3v/warehouse after $$max_retries attempts"
exit 1
fi
echo '...waiting for S3 gateway to be ready for bucket operations...'
sleep "$$retry_delay"
done

Copilot uses AI. Check for mistakes.
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant